We propose a generic Markov Chain Monte Carlo (MCMC) algorithm to speed upcomputations for datasets with many observations. A key feature of our approachis the use of the highly efficient difference estimator from the surveysampling literature to estimate the log-likelihood accurately using only asmall fraction of the data. Our algorithm improves on the $O(n)$ complexity ofregular MCMC by operating over local data clusters instead of the full samplewhen computing the likelihood. The likelihood estimate is used in aPseudo-marginal framework to sample from a perturbed posterior which is within$O(m^{-1/2})$ of the true posterior, where $m$ is the subsample size. Themethod is applied to a logistic regression model to predict firm bankruptcy fora large data set. We document a significant speed up in comparison to thestandard MCMC on the full dataset.
展开▼
机译:我们提出了一种通用的马尔可夫链蒙特卡洛(MCMC)算法,以加快具有许多观测值的数据集的计算速度。我们方法的一个关键特征是使用调查抽样文献中的高效差异估计器,仅使用一小部分数据即可准确估计对数似然率。我们的算法通过在本地数据集群上进行操作(而不是在计算似然时使用完整样本),提高了常规MCMC的$ O(n)$复杂度。在伪边际框架中使用似然估计来从处于真实后验的O(m ^ {-1/2})$之内的扰动后验采样,其中$ m $是子样本大小。该方法应用于逻辑回归模型,以预测大型数据集的公司破产情况。与完整数据集上的标准MCMC相比,我们记录了显着的速度提升。
展开▼